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In this paper, a hybrid method has been introduced to improve the 
classification performance of naive Bayes (NB) for the mixed dataset and 
multi-class problems. This proposed method relies on a similarity measure 
which is applied to portions that are not correctly classified by NB. Since the 
data contains a multi-valued short text with rare words that limit the NB 
performance, we have employed an adapted selective classifier based on 
similarities (CSBS) classifier to exceed the NB limitations and included the 
rare words in the computation. This action has been achieved by transforming 
the formula from the product of the probabilities of the categorical variable to 
its sum weighted by numerical variable. The proposed algorithm has been 
experimented on card payment transaction data that contains the label of 
transactions: the multi-valued short text and the transaction amount. Based on 
K-fold cross validation, the evaluation results confirm that the proposed 
method achieved better results in terms of precision, recall, and F-score 
compared to NB and CSBS classifiers separately. Besides, the fact of 


converting a product form to a sum gives more chance to rare words to 
optimize the text classification, which is another advantage of the proposed 
method. 
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1. INTRODUCTION 

In many cases, datasets consist of both numerical and categorical variables. Many classifiers, such as 
linear regression, support vector regression, and k-nearest neighbour (KNN) are well-defined and validated for 
the computation of numerical variables. For these algorithms, it is easier to establish the relations between a 
target and its predictors when both are numerical. However, the numerical operations are not applicable to 
categorical variables, except if it has been converted to numeric one using coding systems such as dummy 
coding, effects coding, or even contract coding [1, 2]. Another approach is based on similarity and dissimilarity 
measures between categorical and numerical variables, where the data matrix is transformed into a distance 
configuration matrix after applying similar or dissimilar functions [3-5]. 

However, the previous approaches increase the number of predictors when categorical variables 
are numerous. In this case, the coding systems proposed additional steps to reduce the number of 
predictors [6, 7]. Though those approaches do not apply to multi-valued categorical variables that contain more 
than a single word, Mikolov proposes the Word2Vec model that represents the text in a vector format and saves 
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the syntax and the semantic meaning of natural language [8, 9]. The Word2vect is applicable even for a 
disordered multi-word text, where linguistic and semantic rules are not respected. 

In the pre-processing and classification context, some approaches relying on similarity measure 
classification are applying cosine and string similarity to measure the distance between vectors. Other 
approaches propose utterly hybrid classifiers depending on the similarity-based measure. In this context, SBC 
algorithm (similarity-based classifier) [10] and CSBS (selective classifier based on similarities) are two 
algorithms that combine the measures of equality, reliability, and density to classify vectors. Both classifiers 
show excellent performance in terms of text classification [11, 12]. 

On the other hand, naïve Bayes (NB) is still highly useful to classify the categorical and numerical 
variables [13], especially compare its performance with other classifiers. In general, identifying suitable 
similarity measures between categorical variables or between categorical and numerical variables is 
considered a complex challenge. To address this challenge, a hybrid NB model has been constructed using an 
adapted CSBS. Where, the categorical variable is a short text, and we apply tokenization and stop-words in 
the pre-processing phase. For classification, NB has been used to train our model that used only the categorical 
variable. And for the portions that are poorly explained by NB, the adapted CSBS intervened in the second 
phase to improve the classification by including numerical variable. 

The organization of the paper is as follows. Section 2 briefly presents the related works we address 
in the paper. Section 3 provides different methods used in this study. Section 4 introduces a description of the 
proposed hybrid naïve Bayes algorithm. Section 5 shows the experimental results of applying algorithms on 
the real credit card dataset. The last section presents the concluding remarks 


2. LITERATURE REVIEW 
2.1. Categorical variable and similarity measures 

Categorical and qualitative multi-valued data have been studied for a long time in different contexts. 
Computing similarity has a long history, started with chi-square in the late 1800s that is frequently used for 
independence tests between categorical variables. Also, Pearson's chi-square has known many improvements 
that handled several data similarity cases [14]. So far, classical categorical data has changed. Notably, the 
categories number of a qualitative variable has increased to important values. Also, the categorical variables 
start to include multi-valued short text [10], so many limitations are exposed. Fortunately, different methods 
based on similarity measures have been proposed to overcome this challenge. However, the performance of 
those methods depends largely on data characteristics [15]. 

For the main data characteristics, we consider a categorical data contains N objects, with p 
categorical variables. While Ag denotes the kt”variable, and Q the set of different values in A, and n, its 
cardinality. The key characteristics are the following: 

— f;,(): The number of times the attribute Axto take x as a value in a data set. 
— px(x): The sample probability of A;,to take x as a value in a data set, and it is given by; 


fr) 
ie H 


P(x) = 
— p(x): Another probability formula of A; to take x as a value in the given data set, and it’s given by; 


fRrOOSK@)-1) 
N(N-1) (2) 


pi (x) = 


In general, to measure a similarity value between two data instances X and Y belonging to a data set, all used 
measures respect the following form: 


S(Xx,Y) = Èk- We Sr (Xr Yee) (3) 


Sx (Xx, Yk): The per-attribute similarity between two values for the categorical attribute Ax. 

Wp : The weight assigned to the attribute Ag, thereafter, it is fixed to 1/p. 

The above expression has been the point of many studies and is interpreted into different functions depending 
on the data. Where three examples of Sp (Xx, Yk) and w; have been mentioned. Starting with the sample one, the 
overlap measure: it counts the number of attributes that match in the two data instances, using the measure (4): 


1 ifx, =Y, 
Sx (Xe Yk) = i $ x (4) 
0 ifXk £ Yk 


The Goodall 4: measure: aims to normalize the similarity between two objects, based on the probability where 
the similarity value observed could be generated from a random sample of two points [16]. 
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(5) 
2.2. Bank customer transactions classification 

Customer classification and targeting are widely applied in practice. In recent years, banks have 
invested in their data and applied machine learning methods for customer identification, where they achieved 
fruitful results. Eskin et al. [17] propose the use of a random sampling method to improve the support vector 
machine (SVM) model, for bank customer churn prediction. In the same context, De Caigny et al. [18, 19] 
suggested a combination of both methods of logistic regression and decision trees. While for fraud detection, 
Jurgovsky et al. showed how using long short-term memory (LSTM) improves the detection accuracy used 
the Random Forest classifier and incorporated transaction sequences [20]. Others focus on the pre-processing 
part, for the credit applications where various information about payment appear in qualitative, categorical 
attributes. In general, the classification of customer transactions could be used to extend a system that can 
compute socioecological impact from categorized transactions, and provide more analysis about the 
community and its relationship with the geographic location. And it is used in risk management, security and 
fraud detection, or commercial departments bank to identify customer behaviour. 


2.3. Text classification 

Text classification is a fundamental task in natural language processing. It is widely applied in 
sentiment analysis, recommendation and Fraud and spam detection [21, 22]. Machine learning includes many 
approaches for text classification as NB, support vector machine, and other algorithms. Lately, deep learning 
has shown an over-performing compared to traditional machine learning methods. And that is noticed in the 
known methods below: convolutional neural networks (CNNs) [23], recurrent neural networks (RNNs), and 
the combination of CNNs and RNNs [24]. 

Although the great success has shown in processing long sentences, it was not the case for short 
text explained by the data sparsity problem. Recently, many works have been applying various text 
presentation models to extract more information from short text [25, 26]. As mentioned earlier, some are based 
on features from multiple aspects, and others are based on transforming words into vectors. However, the text 
representations still face the data sparsity problem when the data include many new and rare words [27]. In 
our case, the text in question is categorized as a short text, where the variable is very multi-valued. So, the 
new and rare words cause a serious classification problem. In this paper, we propose a hybrid NB classifier 
based on adapted similarity measures applied to card transaction payment data. 


3. RESEARCH METHOD 
3.1. Naive Bayes classifier 

Naive Bayes is a supervised learning algorithm based on a probabilistic classification. 
This classifier is extremely faster compared to other methods. NB aims to calculate the joint probabilities of 
words and categories to estimate each category the text will be affected. The ‘Naive’ expression is due to the 
fact the words are independents. In other words, the conditional probability of a word from a category is 
assumed to be independent of the conditional probabilities of other words from the same category [28]. 


3.2. CSBS classifier 

The CSBS is a classifier based on similarity measures, in which the treated limitations shown for 
short text classification are based on three measures: equality, reliability, and density [10]. For the sake of 
notation, for a class C, we distinguish between the amplitude a° and the own amplitude a“, When the own 
amplitude of a given attribute serves to predict whether this is reliable relatively compare to other attributes, 
and that through eliminating the intervals containing values belonging to the other classes from a£. 

In CSBS classifier [11], equality is measured by the number of objects sharing the same values per 
attribute. The higher the measure is, the more the values indicate the membership to the class. However, the 
own amplitude indicates the reliably of the attribute. At the same time, an instance is more likely to belong to 
a class when the attribute value is included in its own amplitude. While the density of the membership of an 
instance to a class C is measured using the (6): 


NE +as* - d(xıj cf) 


— lym j 
fic = z Èj=1Pjc x N praf te (6) 


where: M is the number of attributes. 

N : The number of instances. 

Pjc : The coefficient of reliability on x; to predict C. 

NF : The number of instances that take the value of processed instance on attribute jt” per C. 
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: The own amplitude of C per attribute jt”. 

: The simple amplitude of C per attribute j*. 

: The center of C per attribute j+”. 

: The number of instances that take the value of processed instance on attribute jt”. 


: A very small positive value. 
Finally, the class of a given instance is the one having the highest membership measure using (7): 


™ = QI A F 
* 


Yinstance =argMaxe e classes group fic (7) 


4. THE PROPOSED METHOD 
The primary purpose of the proposed algorithm is to provide a new hybrid algorithm that performs 
better for mixed data. This algorithm combines the individual strengths of NB for text application and CSBS. 
It mitigates the disadvantages of the two methods knowing that the performance of NB moves down where the 
number of rare words goes up. Besides, it has numerous advantages that can be described as follows: 
— By combining a probabilistic algorithm with an algorithm based on distance and density, the model 
eliminates the probabilistic property of the proposed method. 
— The computation complexity is lower compared to NB model as the proposed classifier turned the product 
form into a sum form. 
— The impact of rare words number can not be ignored since it becomes an optimizer of classification 
performance. 
— The CSBS contains a normalized distance, which is better for numerical variables applications. 
— Implementation is more simple and easier. 
The communicated advantages could be noticed through the algorithm’s description as shown in 
Figure 1. The process shows the main steps to exceed the constraint due to NB fail to classify a particular 
instance, and the combination with the adapted CSBS in a specific stage. To illustrate the logic of our proposed 
model, Figure 2 represents the dealing of different components at each level. The trials’ number is based on 
the value of K. For each trial the NB classifies the text instances based on the occurrence of words and the 
probabilities of belonging. However, and due to the high number of rare words, the NB affects an important 
portion to the wrong class. By adding the weight of the numerical dimension, the adapted CSBS tries to make 
the classification better and promote the position of each word in the dataset. 


*Pre-processing: 

l: let X-matrix denotes the data with M predictors: X-matrix= [x,,x,,..%,] where p predictors are categorical and M-p+1 are 
numerical, and the class label is y- 

2: Clean and pre-process X’-matri= [xx, , X}, ....x,], by removing the stops words and building the bags of words. 

*Training 

3: Obtam the tram set instances. 

4: Tram the NB model with the training dataset, 


*Classification 

5: Obtain the test dataset. 

6: Calculate the probabilities of belonging of the instance, using only the categorical part of the dataset. 

4: Select the X-matrix train set corresponding to unclassified instances from the step above, and apply the adapted CSBS according 
to formula (8). 

*Evaluanon: 

7: The model evaluation based on the performance of both NB and the adapted CSBS results, using K-fold cross validation. 





Figure 1. The proposed algorithm 
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Figure 2. Illustration of different stages of proposed algorithm 
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5. RESULTS AND ANALYSIS 
5.1. Experiments 
5.1.1. Data description and preparation 

The aim of our proposed solution is to effectively handle mixed data for card transactions payment 
classification problems. The dataset illustration contains 1312 instances and two variables. The first variable 
is a categorical variable that describes the transaction labels. The second is the numeric variable that consists 
of the amount associated with each operation. We extracted the data from a personal account created in 
Moroccan bank territory that we aim to classify them into four classes. 

Observing our dataset, the categorical variable is an unstructured text and does not strictly respect 
the syntax or the semantic meaning of natural language (English, French...), or any abbreviation rules. Or 
either the emplacement of a word in a sentence does not have any importance. It could be categorized as a 
normal categorical dimension with few values, other cases contain multi-values, further, and it may also be 
classed as short text. In Table 1, each case has been presented with some selected instances. 

The preparation of such data imposes three parts: tokenization, removal of stop words, then the 
construction of the bag of words. To tokenize the text of the categorical variable, strings of text have been split 
into words, we moved, and the stop words have been identified. For example: the, and, or... Stop words can 
also be a specified list of expressions, for example, taking the label: “Supermarket EL JADIDA”, the expression 
“EL JADIDA” which is a name of a Moroccan city, has no sense in our proposed model, so our list of stop 
words combine the standard stop words in French and English languages list and the list of all Moroccan cities. 
Finally, the bag of words has been constructed as a matrix. This one helps the classifier to train on the data and 
recovers the significant terms of each class. 


Table 1. Different cases selected from payment transaction text variable 
Case Payment transaction text Comment 
Standard Categorical dimension “Achat YVES ROCHER MAROC“ Each instance belongs to different classes, and 
“Achat via WWW.ALIEXPRESS.COM“ it appears in one form for the whole dataset. 
“Pay UBER MAROC E-COM bill” 








Multi-value categorical dimension “Achat Marjane market Alina” All instances belong to same classes, however 
“Achat Marjane Bigdil” the third one will be misclassified based using 
“Pay Marjane bill” NB. 

Short text “Bill L'ARBRE DE ZOE” The rare words are highly represented in this 
“Facture KINANI CHAUSSURES” sample, the only keywords are “bill” and 
“GRAS SAVOYE Molay Youssef” “facture”, and the both are not enough to affect 


a correct classification with NB. 





5.1.2. Experimental procedures 

To evaluate the proposed algorithm, we train with three models. The first is NB, which was applied 
to the categorical variable to avoid the overlapping of the numerical variable. The second model used the 
adapted CSBS on both categorical and numerical variables. The last one introduced our proposed model that 
combines the NB and the adapted CSBS algorithm. To adjust the CSBS (cited in (6)) to the structure of the 
dataset. The adapted CSBS is given in the (8): 





C Cx Ro 
_ Neta — d(x ,cf) 1 um! i 
Sic (X) = Neratte xX 5 duj=1 Pic (8) 


where: pjc indicate the frequency of the word wj; per class C. 
t: used to index the parameters of the numerical attribute. 
M’: Number of words of the categorical variable 

For a reasonable comparison, we organized the dataset into different subset sizes, n=280, 560, 840, 
and 1120, respectively, which are selected each time arbitrarily from our dataset of 1312 instances. The 
K-Fold Cross-validation sampling method is frequently used to evaluate models in machine learning and data 
mining. The dataset is segmented randomly into K segments, where each segment is retained once, and the 
classifier is learned on the other K-1 segments. In our case, K will take 4, 7, and 10, respectively. 
Therefore, the learning procedure is performed K times on each different subset. The overall performance is 
evaluated in terms of recall, precision, and F-measure: 








oie TP 
Precision = (9) 
TP+FP 
TP 
Recall = (10) 
TP+FN 

RecallxPrecision 

F_score = 2 Xx ————_ (11) 
Recall+Precision 
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where: FN is the number of false negatives. 

FP is the number of false positives. 

TP is the number of true positives. 

The calculation of those two factors in a multi-class classifier situation request the notions below: 
Classified 


Ce aot Oy. 
C =Actual 


Chi Cnn 


The confusion elements for each class are given by: 


tpi = Cü ; fni = Liki Cu — tpi (12, 13) 
fpi = Xi=1 Cu — tpi (14) 
tni = Ði-1 Èk=-1 Ca tPi — fpi — fn; (15) 


5.2. Experiments results: 

The performance evaluation of our hybrid model constructed using K-fold cross-validation 
introduced in the section above. Since the parameter K took different values, we compute the model on 30 
trials for each sample size. The results for the three classifiers NB, adapted CSBS, and the proposed method 
are reported in Table 2. The improvements of the hybrid method in terms of the different measures refer at 
first to the performance of naïve Bayes on the dataset, then at second to the adding of the adapted CSBS 
performance applied to the partitions poorly classified. Furthermore, the notable role of the adapted CSBS 
could not be denied, since it kept an excellent harmonic mean between the recall and the precision for each 
different simulation. And better, when it is combined with NB performance. To present the progress of our 
classifier in term multi-classification improvement, we selected for K=10 four trials randomly applied on a 
sample of n=280. And based on Table 3, which describes the recall, precision, and F-score values, the 
proposed method outperformed for the three evaluation indicators. 


Table 2. The results of the different classifier for different K value, based on 30 trials on average 








Naive Bayes Adapted CSBS The proposed model 
Sample size Recall Precision F-score Recall Precision F-score Recall Precision F-score 
K=4 280 0.63 0.76 0.62 0.78 0.79 0.83 0.79 0.89 0.89 
560 0.61 0.73 0.62 0.75 0.82 0.79 0.78 0.89 0.86 
840 0.72 0.71 0.71 0.83 0.89 0.77 0.88 0.93 0.86 
1120 0.76 0.68 0.72 0.76 0.75 0.72 0.89 0.89 0.94 
K=7 280 0.71 0.75 0.64 0.78 0.84 0.75 0.84 0.92 0.93 
560 0.78 0.69 0.62 0.84 0.74 0.64 0.8 0.88 0.85 
840 0.63 0.79 0.72 0.65 0.87 0.63 0.83 0.94 0.83 
1120 0.67 0.71 0.74 0.74 0.85 0.71 0.98 0.91 0.89 
K=10 280 0.6 0.61 0.62 0.77 0.89 0.62 0.88 0.8 0.88 
560 0.7 0.6 0.62 0.83 0.81 0.73 0.84 0.97 0.8 
840 0.76 0.8 0.71 0.74 0.74 0.66 0.77 0.96 0.89 
1120 0.78 0.67 0.72 0.72 0.84 0.77 0.9 0.88 0.94 





Table 3. The results of precision, recall, and F-score per trial and per method 











Method Recall Precision F-Score 
Trial.1 1 Naive Bayes 0.78 0.89 0.83 
2 Adapted CSBS 0.74 0.76 0.75 
3 Proposed method 0.89 0.94 0.91 
Trial.2 4 Naive Bayes 0.9 0.85 0.88 
5 Adapted CSBS 0.78 0.77 0.78 
6 Proposed method 0.94 0.91 0.92 
Trial. 3 7 Naive Bayes 0.87 0.9 0.88 
8 Adapted CSBS 0.8 0.74 0.77 
9 Proposed method 0.93 0.94 0.94 
Trial. 4 10 Naive Bayes 0.83 0.85 0.84 
11 Adapted CSBS 0.77 0.75 0.76 
12 Proposed method 0.9 0.93 0.91 





Even more, the hybrid method guarantees a good efficiency in terms of the one class classification 
performance, so we have: 
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Precision (Cnp) < Precision (Cthe poposed method) And: Recall(Cnp ) < Recall(Cthe poposed method) 


To visualize this, enhance, a demonstration with a confusion matrix is recommended. Figure 3 illustrates the 
confusion matrix of different selected trials per method. Moving from NB to adapted CSBS to the proposed 
method for each trial, the numbers in the confusion matrix increased where the numbers outside decreased, 
which proves the progress of one-class classification. We also note that the True Positive in tables (3), (6), 
(9), and (12) are better than its equivalent in tables (2), (5), (8), and (11). This result highlights the fact of how 
the hybrid method works significantly better for the rare words and achieved excellent results for both mixed 
data classification and text classification. In general, the NB shows good results comparing to the results of 
CSBS. However, the combination of both achieved meaningful classification progress. 









































Figure 3. The confusion matrices of four trials were randomly selected to explain the result of Table 3 


6. CONCLUSION 

The main objective of this contribution is to deal with the classification of mixed data that include 
a multi-valued short text variable. We introduced a hybrid naive Bayes that is based on similarity measures to 
effectively process both categorical and numerical variables. In the proposed method, the naive Bayes predicts 
the portion of the target only explained by the categorical variable, and the remaining part is predicted using 
the adapted CSBS that provides good classification using numerical variables. The proposed solution 
combines NB with an adapted CSBS. The hybrid model was compared to the naive Bayes, and the adapted 
CSBS separately. The experiments were performed using the card transactions payment data that contains a 
multi-valued short text variable and numerical variable. The solution has achieved significant progress in 
terms of recall, precision, and F-measure. Furthermore, it deals well with rare words issues, and also improves 
the classification of the model. 

This work is limited because it has not been applied to different known dataset yet. However, it 
was proposed to handle the classification of short text using multi-valued variables, applied to a real case 
problem: card transaction payment classification. This study could be extended on many mixed datasets in a 
different field in order to optimize the classification of categorical dimensions. In future work, the 
dimensionality of vector-text supported by our method will be investigated while maintaining its simplicity. 
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